Running head: DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 1 Descriptive Statistics for Modern Test Score Distributions: Skewness, Kurtosis, Discreteness, and Ceiling Effects

نویسندگان

  • Andrew D. Ho
  • Carol C. Yu
چکیده

Many statistical analyses benefit from the assumption that unconditional or conditional distributions are continuous and normal. Over fifty years ago in this journal, Lord (1955) and Cook (1959) chronicled departures from normality in educational tests, and Micerri (1989) similarly showed that the normality assumption is met rarely in educational and psychological practice. In this paper, the authors extend these previous analyses to state-level educational test score distributions that are an increasingly common target of high-stakes analysis and interpretation. Among 504 scale-score and raw-score distributions from state testing programs from recent years, non-normal distributions are common and are often associated with particular state programs. The authors explain how scaling procedures from Item Response Theory lead to non-normal distributions as well as unusual patterns of discreteness. The authors recommend that distributional descriptive statistics be calculated routinely to inform model selection for large-scale test score data, providing warnings in the form of sensitivity studies that compare baseline results to those from normalized score scales. DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 3 Descriptive Statistics for Modern Test Score Distributions: Skewness, Kurtosis, Discreteness, and Ceiling Effects Introduction Normality is a useful assumption in many modeling frameworks, including the general linear model, which is well known to assume normally distributed residuals, and structural equation modeling , where normal-theory-based maximum likelihood estimation is a common starting point (e.g., Bollen, 1989). There is a vast literature that describes consequences of violating normality assumptions in various modeling frameworks and for their associated statistical tests. A similarly substantial literature has introduced alternative frameworks and tests that are robust or invariant to violations of normality assumptions. A classic, constrained example of such a topic is the sensitivity of the independent-samples t-test to normality assumptions (e.g., Boneau, 1960), where violations of normality may motivate a robust or nonparametric alternative (e.g., Mann & Whitney, 1947). An essential assumption that underlies this kind of research is that the degree of non-normality in real-world distributions is sufficient to threaten the desired interpretation in which the researcher is most interested. If most distributions in a particular area of application are normal, then illustrating consequences of non-normality and motivating alternative frameworks may be interesting theoretically but of limited practical importance. To discount this possibility, researchers generally include a realworld example of non-normal data, or they at least simulate data from non-normal distributions that share features with real-world data. Nonetheless, comprehensive reviews of the non-normality of data in educational and psychological applications are rare. Almost sixty years ago in this journal, Lord (1955) reviewed the skewness and kurtosis of 48 aptitude, admissions, and certification tests. He found that test score distributions were generally negatively skewed and platykurtic. Cook (1959) replicated Lord’s analysis with 50 classroom tests. Micceri (1989) gathered 440 distributions, 176 of these from large-scale educational tests, and he described 29% of the 440 as moderately asymmetric and 31% of the 440 as extremely asymmetric. He also observed that all 440 of his distributions were non-normal as indicated by repeated application of the Kolmogorov-Smirnov test (p < .01). DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 4 In this paper, we provide the first review that we have found of the descriptive features of statelevel educational test score distributions. We are motivated by the increasing use of these data for both research and high-stakes inferences about students, teachers, administrators, schools, and policies. These data are often stored in longitudinal data structures (e.g., U.S. Department of Education, 2011) that laudably lower the barriers to the analysis of educational test score data. However, as we demonstrate, these distributions have features that can threaten conventional analyses and interpretations therefrom, and casual application of familiar parametric models may lead to unwarranted inferences. Such a statement is necessarily conditional on the model and the desired inferences. We take the Micerri (1989) finding for granted in our data: these distributions are not normal. At our state-level sample sizes, we can easily reject the null hypothesis that distributions are normal, but this is hardly surprising. The important questions concern the magnitude of non-normality and the consequences for particular models and inferences. We address the question of magnitude in depth by presenting skewness, kurtosis, and discreteness indices for 504 raw and scale score distributions from state testing programs. Skewness and kurtosis are well established descriptive statistics for distributions (Pearson, 1895) and are occasionally used as benchmarks for non-normality (e.g., Bulmer, 1979). We illustrate the consequences of non-normality only partially. This is deliberate. A complete review of all possible analyses and consequences is impossible given space restrictions. Thus, our primary goal is to make the basic features of test score distributions easily describable and widely known. These features may guide simulation studies for future investigations of the consequences of violating model assumptions. Additionally, if variability in these features is considerable, this motivates researchers to use an arsenal of diverse methods to achieve their aims, with which they might manage tradeoffs between Type I and Type II errors, as well as bias and efficiency. We have two secondary goals. First, we provide illustrative examples of how these features can lead to consequential differences for model results, so that researchers fitting their own models to these data may better anticipate whether problems may arise. Second, we explain the pathology of the nonnormality that we observe in the data. We demonstrate that, when test score distributions are cast as the DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 5 results of particular models and scaling procedures, their non-normal features should not be surprising. We emphasize that non-normality is not inherently incorrect or flawed—it is the responsibility of the researcher to fit the model to the data, not the reverse. However, if the resulting features are undesirable for the primary intended use of the test scores, then the procedures that generate the distributions should be reconsidered. To accomplish these multiple goals, we present this paper in four parts. The first part introduces the pathology of non-normality by describing skewness and kurtosis of raw score distributions as natural properties of the binomial and beta-binomial distributions. This analysis of raw scores serves as a conceptual link to the Lord (1955), Cook (1959), and Micerri (1989) analyses, which were dominantly raw-score-based, and sets an historical baseline from which to evaluate modern uses of test scores, which are scale-score-based, that is, using scores developed using a particular scaling method. In all of the cases presented here, this scaling method is Item Response Theory (IRT; see Lord, 1980). In the second part, we present the primary results of the paper in terms of skewness and kurtosis of both raw score and scale score distributions. We continue to emphasize the pathology of non-normality by describing the differences between features of raw score and scale score distributions as a consequence of scaling methods using IRT. The third part of the paper uses visual inspection of test score distributions to motivate additional descriptive statistics for scale score distributions. In particular, we motivate statistics that describe the discreteness of distributions, and we show that visually obvious “ceiling effects” in some distributions are better identified by discreteness than by skewness. Here, we begin to illustrate simple consequences that threaten common uses of test scores, including comparing student scores, selecting students into programs, and evaluating student growth. Finally, in the fourth part of the paper, we describe the possible impact of non-normality on the results of regression-based test score analyses. We illustrate this with results from predictive models for students and a “value-added”-type model for school-level scores. For these sensitivity studies, we compare results estimated using observed distributions to results estimated using their normalized counterparts. This answers the question, if these distributions were normal, how DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 6 would our interpretations differ? Taken together, these arguments build toward a familiar if important principle: researchers should select models with knowledge of the process that generated the data and informed by data features, or else risk flawed analyses, interpretations, and decisions. Data A search of publicly available state-level test-score distributions yielded 330 scale score distributions and 174 raw score distributions from 14 different state testing programs in the academic years ending 2010 and 2011. We constrained the time period to these two years because these state testing programs were fairly stable at that time and because all states had some data in both years. One of the programs, the New England Common Assessment Program (NECAP), represents 4 states, Maine, New Hampshire, Rhode Island, and Vermont, thus the data represent 17 states in total. As Micerri (1989) noted, the data appropriate for full distributional analyses are rarely publicly available, however, these states have considerable regional coverage across the United States. We collected distributions from 6 grades (3-8) and 2 subjects (Reading/English Language Arts and Mathematics), for a total of 12 possible scale score distributions per year. Raw score distributions were available from 8 of the 14 state testing programs. We use raw scores largely for illustration, as a historical reference point to previous research, and as a conceptual reference point to emphasize the consequences of IRT scaling on skewness and kurtosis. Since scores are generally reported and analyzed using scale scores, the latter half of the paper focuses on scale scores. Nebraska is missing Mathematics score distributions in 2010, and Oklahoma is missing raw score distributions in 2010. Table 1 shows the minimum and maximum number of examinees across the available score distributions in each state. Unsurprisingly, these numbers are indicators of the relative youth populations of each state. The distribution with the highest n-count is from Texas with 689938 examinees, and the distribution with the lowest n-count is from South Dakota with 8982 examinees. Altogether, the distributions represent data from over 31 million examinees. In the last two columns, Table 1 shows the minimum and maximum number of discrete score points in the scale score distributions. Distributions DESCRIPTIVE STATISTICS FOR MODERN SCORE DISTRIBUTIONS 7 from Colorado are outliers in terms of their counts of discrete score points, with counts around 500 due to their practice of scoring based on patterns of item responses rather than summed scores (CTB-McGraw Hill, 2010a, 2011a). Distributions from all other states have discrete score point counts ranging from 28 (New York, Grade 5, English Language Arts, in 2010) to 93 (Texas, Grade 7, Reading, in 2011). Skewness and Kurtosis We use skewness and kurtosis as rough indicators of the degree of normality of distributions or the lack thereof. Unlike test statistics from normality testing procedures like the Kolmogorov-Smirnov D or the Shapiro-Wilk W, skewness and kurrtosis are used here like an effect size, to communicate the degree of non-normality, rather than statistical significance under some null hypothesis of normality. The use of skewness and kurtosis to describe distributions dates back to Pearson (1895) and has been reviewed more recently by Moors (1986), D’Agostino, Belanger, and D’Agostino (1990), and DeCarlo (1997). Skewness is a rough index of the asymmetry of a distribution, where positive skewness in unimodal distributions suggests relatively plentiful and/or extreme positive values, and negative skewness suggests the same for negative values. Skewness is an estimate of the third standardized moment of the population distribution, s = �n(n − 1) n − 2 1 n∑ (xi − ?̅?) 3 i �1 n∑ (xi − ?̅?) 2 i � 3/2. Skewness can range from −∞ to +∞, and symmetric distributions like the normal distribution have a skewness of 0. The n-based bias correction term for the population estimate is negligible for the large samples that we have here, but we include it for completeness. Kurtosis is an estimate of the fourth standardized moment of the population distribution, k = n(n + 1)(n − 1) (n − 2)(n − 3) ∑ (xi − ?̅?)4 i (∑ (xi − ?̅?)2 i )2 . Kurtosis can range from 1 to +∞. The kurtosis of a normal distribution is 3. Although it is common to subtract 3 from k and describe this as “excess kurtosis”—beyond that expected from a normal (1)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exp-Kumaraswamy Distributions: Some Properties and Applications

In this paper, we propose and study exp-kumaraswamy distribution. Some of its properties  are derived, including the density function, hazard rate function, quantile function, moments,  skewness  and kurtosis.   Adata set isused to illustrate an application of the proposed distribution. Also, we obtain a new distribution by transformation onexp-kumaraswamy distribution.   New distribution is an...

متن کامل

Adjusting the tests for skewness and kurtosis for distributional misspecifications

The standard root−b1 test is widely used for testing skewness. However, several studies have demonstrated that this test is not reliable for discriminating between symmetric and asymmetric distributions in the presence of excess kurtosis. The main reason for the failure of the standard root−b1 test is that its variance formula is derived under the assumption of no excess kurtosis. In this paper...

متن کامل

Behaviour of skewness, kurtosis and normality tests in long memory data

We establish the limiting distributions for empirical estimators of the coefficient of skewness, kurtosis, and the Jarque–Bera normality test statistic for long memory linear processes. We show that these estimators, contrary to the case of short memory, are neither √ n-consistent nor asymptotically normal. The normalizations needed to obtain the limiting distributions depend on the long memory...

متن کامل

Univariate and multivariate skewness and kurtosis for measuring nonnormality: Prevalence, influence and estimation.

Nonnormality of univariate data has been extensively examined previously (Blanca et al., Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 9(2), 78-84, 2013; Miceeri, Psychological Bulletin, 105(1), 156, 1989). However, less is known of the potential nonnormality of multivariate data although multivariate analysis is commonly used in psychological and edu...

متن کامل

The Weibull Topp-Leone Generated Family of Distributions: Statistical Properties and Applications

Statistical distributions are very useful in describing and predicting real world phenomena. Consequently, the choice of the most suitable statistical distribution for modeling given data is very important. In this paper, we propose a new class of lifetime distributions called the Weibull Topp-Leone Generated (WTLG) family. The proposed family is constructed via compounding the Weibull and the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014